where \(u_1\) and \(v_1\) are the leading left and right singular vectors of \(C_{xx}^{-1/2} \times C_{xy} \times C_{yy}^{-1/2}\), that is the first column vectors of \(U\) and \(V\).
Proof:
NoteProposition
A sequence of canonical components of \(C_{xy}\) can be obtained from the sequence of (extended) left and right singular vectors of \(C_{xy}\) with respect to \(C_{xx}\) and \(C_{yy}\)
Proof:
NoteProposition
Let \(H_X\) (resp. \(H_Y\)) be orthorgonal projection matrix on the linear space spanned by the columns of \(X\) (resp. \(Y\)).
Canonical correlations \(ρ_1 \geq \ldots \geq \rho_s, \ldots\) are the positive square roots of the eigenvalues \(\lambda_1, \ldots \geq \lambda_s, \ldots\) of \(H_X \times H_Y\) (which are the same as \(H_Y \times H_X\)): \(ρ_s = λ_s\) Vectors \(U^1, \ldots, U^{p_1}\) are the standardized eigenvectors corresponding to the decreasing eigenvalues \(λ_1 \geq \ldots \geq \lambda_{p_1}\) of \(H_X \times H_Y\)
Vectors \(V^1, \ldots, V^{p_2}\) are the standardized eigenvectors corresponding to the decreasing eigenvalues \(λ_1 \geq \ldots \geq \lambda_{p_2}\) of \(H_X \times H_Y\)
Canonical Correlation Analysis (CCA) in R
cancor() from base package R
Function cancor(x, y, xcenter=T, ycenter=T) computes the canonical correlations between two data matrices x and y. Henceforth we assume that the columns of x and y are centered. Matrices x and y have the same number \(n\) of rows. x (resp. y) has p1 (resp. p2) columns.
The canonical correlation analysis seeks linear combinations of the y variables which are well explained by linear combinations of the x variables. The relationship is symmetric as well explained is measured by correlations.
The result is a list of five components
cor correlations.
xcoef estimated coefficients for the x variables.
ycoef estimated coefficients for the y variables.
Our assumption above allows us to assume xcenter and ycenter are zeros.
The next example is taken from the documentation. Use ?LiveCycleSavings to get more information on the dataset.
This tells us that highest possible linear correlation beween a linear combination of pop15, pop75 and a linear combination of sr, dpi, ddpi is res.cca$cor[1]. The coefficients of the corresponding linear combinations can be found on the rows of components xcoef and ycoef
NoteQuestion
Check that the different components of the output of cancor() satisfy all properties they should satisfy.
Canonical correlations analysis (CCA) is an exploratory statistical method to highlight correlations between two data sets acquired on the same experimental units. The cancor() function in R (R Development Core Team 2007) performs the core of computations but further work was required to provide the user with additional tools to facilitate the interpretation of the results.
As in PCA, CA, MCA, several kinds of graphical representations can be displayed from the results of CCA:
a barplot of the squared canonical correlations (which tells us about the low rank approximations of \(H_X \times H_Y\))
scatter plots for the initial variables \(X^j\) and \(Y^k\) (ako correlation circles)
scatter plots for the individuals (rows)
biplots
Applications
NoteQuestion
Load nutrimouse dataset from CCA.
Insert the 4 elements of list nutrimouse in the global environment (see list2env())
\(H_X\times H_Y\) has 21 eigenvalues equal to \(1\). As the subspaces defined by the columns in gene and lipid have dimensions at most 21 and 40, \(H_X\times H_Y\) equals the projection of \(\mathbb{R}^{40}\) over the smallest subspace.
NoteQuestion
Sample 10 columns from gene and lipid and repeat the operation
res.cca$cor|>as_tibble()|>gt::gt()|>gt::fmt_scientific()|>gt::tab_caption("Canonical correlations between `gene` columns of nutrimouse and `lipid` columns")
Canonical correlations between `gene` columns of nutrimouse and `lipid` columns
value
9.62 × 10−1
8.82 × 10−1
7.90 × 10−1
7.35 × 10−1
6.96 × 10−1
5.66 × 10−1
5.09 × 10−1
2.67 × 10−1
1.58 × 10−1
6.62 × 10−2
Code
res.cca$cor|>as_tibble()|>mutate(PC=as.factor(1:n), eig=value^2, percent=eig, cumulative=cumsum(eig))|>ggplot()+aes(x=PC, y=eig, label=eig)+geom_col(fill="white", color="black")+theme_minimal()+labs( title="Squared Canonical Correlations", subtitle="sample of 10 genes and 10 lipids", caption="nutrimouse data")